Indonesian Journal of Electrical Engineering and Computer Science 
Vol. 12, No. 1, October 2018, pp. 46~50 
ISSN: 2502-4752, DOI: 10.1159 1/ijeecs.v12.11.pp46-50 i) 46 


Random Forest Approach fo Sentiment Analysis in Indonesian 
Language 


M. Ali Fauzi 


Faculty of Computer Science, Brawijaya University, Malang, Indonesia 


Article Info ABSTRACT 

Article history: Sentiment analysis becomes very useful since the rise of social media and 
online review website and, thus, the requirement of analyzing their sentiment 

Received May 5, 2018 in an effective and efficient way. We can consider sentiment analysis as text 

Revised Jul 6, 2018 classification problem with sentiment as its categories. In this study, we 

Accepted Jul 10, 2018 explore the use of Random Forest for sentiment classification in Indonesian 


language. We also explore the use of bag of words (BOW) features with 
some term weighting methods variation such as Binary TF, Raw TF, 
Keywords: Logarithmic TF and TF.IDF. The experiment result showed that sentiment 
analysis system using random forest give good performance with average 
OOB score 0.829. The result also depicted that all of the four term weighting 
method has competitive result. Since the score difference is not very 


Random forest 
Sentiment analysis 


Term weighting significant, we can say that the term weighting method variation in study has 
Text classification no remarkable effect for sentiment analysis using Random Forest. 
TF.IDF 


Copyright © 2018 Institute of Advanced Engineering and Science. 
All rights reserved. 


Corresponding Author: 


M. Ali Fauzi, 

Faculty of Computer Science, 

Brawiyaya University, Malang, Indonesia. 
Email: moch.ali.fauzi@ub.ac.id 


1, INTRODUCTION 

Nowadays, people tend to write their experience, feeling, opinions, and views about events, products 
or services in online platforms such as social media, blog, forum, shopping sites, or review sites. It makes 
online platforms become a source of highly valuable information for both consumers and producers. 
Customers get second opinions before purchasing some products or services. On the other hand, 
producers get information about what people think about their products or services and predict the public 
acceptance rate level. This information can be very useful for improvement and marketing strategies [1]. 

Sentiment analysis is a task of analyzing people’s opinions from a piece of text in order to specify 
whether the sentiments are positive, negative or neutral. Sentiment Analysis have been obtaining popularity 
over the past years as a result of the rise of social media and online review website and, thus, the requirement 
of analyzing their sentiment in an effective and efficient way. Sentiment analysis is currently a major 
research field with many applications in a large number of domains such as election results 
prediction [2]-[4], stock market prediction [5], [6], products and merchants ranking [7], movie revenues 
prediction [8]-[10], learning evaluation [11], [12], and etc. 

We can consider sentiment analysis as text classification problem with sentiment as its categories. 
Therefore, we can use supervised machine learning approaches to tackle this problem. This approach is very 
popular in sentiment analysis and proven to be very good in this filed. Some machine learning approach that 
have been used in this field for example Naive Bayes [13]-[17], Support Vector Machines [18]-[19], 
Maximum Entropy [20], Neural Network [21], [22] decision tree and K-Nearest Neighbor (KNN) [23]-[26]. 

In this study, we explore the use of Random Forest for sentiment classification in Indonesian 
language. Random Forest is an ensemble learning technique based on decision tree algorithm [27]. 
Random Forests have been incredible in recent years since the performance of this type of algorithms have 
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surpass SVMs, Naive Bayes and other machine learning algorithms for classification task in some domain 
like bioinformatics and computational biology [28]. We will try whether this type of ensemble methods still 
outstanding on sentiment analysis tasks. In this study, we will also explore the use of bag of words (BOW) 
features with some term weighting methods variation such as Binary TF, Raw TF, Logarithmic TF and 
TF.IDF. 


2. RESEARCH METHOD 

As depicted in Figure 1, sentiment analysis system in this study consists of three main stages, 
preprocessing, features extraction and classification using Random Forest. The ouptut of classification result 
is two category, positive and negative. 
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Figure 1. System main flowchart 


2.1. Preprocessing 

The first stage of this system is preprocessing. This stage involves several processes including 
tokenization, case folding and cleaning. Tokenization is a task of splitting review text into smaller units 
called tokens or terms [29], [30]. Case folding is a task of making all of characters in review text become 
lowercase [31], [32]. Meanwhile, cleaning is a task of removing punctuation, numbers, html tag and 
characters outside of the alphabet. In this study, we don’t employ stemming and filtering since in some 
previous works about sentiment analysis, stemming and filtering cannot improve classification performance. 


2.2. Feature Extraction 

Bag-of-word (BOW) features will be used in this study. Each document would be represented as a 
vector in a space terms with the unique terms from preprocessing stage become its features. The feature 
vector value is determined using some term weighting method. The most popular term weighting methods are 
Term Frequency (TF), Inverse Document Frequency (IDF) and the combination of the two, Term Frequency 
Inverse Document Frequency (TF.IDF) [33]. 

Term Frequency is assigning weights by assuming that each term have a contribution that is 
proportional to the number of its occurrences in the document [34], [35]. There are some popular variation of 
TF such as Binary TF, Raw TF, and Logarithmic TF. Using Binary TF, each document is represented as a 
binary vector. A term that occurs in a document will get value 1 in the document vector, otherwise a term that 
never occurs in a document will get value 0. This kind of term weighting does not consider the number of 
term occurrences, only 0/1 values. In contrast to Binary TF, Raw TF method does consider the number of 
term occurrences. A term will get value based on how many times it appears in the document. 
Meanwhile Logarithmic TF also consider the number of term occurrences. The difference is Logarithmic TF 
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assume that the importance of a term in a document does not increase proportionally with term how many 
times it occurs. The weights of term t in document d using Logarithmic TF can be counted as follows: 


TF (t,d) =1+log(f,,] (1) 


where f t.d 18 the number of the how many times term t appears in the document d. 


Meanwhile, Inverse Document Frequency is a global term weighting that been counted by regarding 
the distribution of the term in the dataset. This term weighting will give higher value for a rare term, a term 
that only appears in certain documents. The weights of term t using IDF formulated as follows: 





t 


IDF (t) =1 +b 
if (2) 


where V q 1s the number of documents in dataset and df , 1S the number of documents in dataset that where 


term t appears. 
The most popular term weighting is TF.IDF. TF.IDF is a multiplication of TF and IDF. The weight 
combination of term t in document d can be counted as follows [36]: 


TF e IDF (t,d) =TF(t,d)e IDF (t) (3) 
where 7 F(t, d) is the TF value of term t in document d and /DF'(t) is the IDF value of term t. 


2.3. Sentiment Classification using Random Forest 

The last stage is sentiment classification. Each review will be classified into positive or negative 
category. In this study, we employ random forest for the classification task. Random forest algorithm is a 
supervised classification algorithm. It is an ensemble learning technique based on decision tree 
algorithm [27]. This Ensemble technique combines the predictions of some base estimators constructed with 
decision tree algorithm to enhance robustness over an individual estimator. Random Forest grows a lot of 
classification trees, which is called forest. If we want to classify a new data, each tree gives its category 
prediction as one vote. The forest chooses the category that has majority voting. In general, the more trees in 
the random forest the higher accuracy results given. 

Random Forests have been gaining popularity in recent years since the performance of this type of 
algorithms have outstanding for classification task in some domain like bioinformatics and computational 
biology. There also some works in text classification using Random forest such as for hatespeech 
detection [37] and authorship profiling [38]. 


3. RESULTS AND ANALYSIS 

Experiment conducted by using 386 reviews taken from FemaleDaily. All of the reviews is in 
Indonesian language. Instead of using cross validation, Random Forest use out-of-bag (OOB) error estimate 
to get an unbiased estimate of the classification performance. OOB score range form 0 to 1. The higher OOB 
score the better classification performance, otherwise the lower OOB score indicates worse classification 
performance. In the experiment, Random Forest will be tested using several term weighting method including 
Binary TF, Raw TF, Logarithmic TF, and TF.IDF. The experiment is conducted using Scikit-learn 
library [39]. Theresult can be seen in Figure 2. 
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Figure 2. Sentiment analysis experiment reuslt using random forest 
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Figure 2 show that sentiment analysis using random forest give good performance with average OOB 
score 0.829. We can also see form Figure 2 that all of the four term weighting method has competitive result. 
The OOB score between is just slightly different. The best OOB score is gained by Raw TF by 0.837. 
The lowest OOB score is gained by Logarithmic TF by 0.821. In the second place is Binary TF with OOB 
score 0.829 and the third place is TF.IDF with OOB score 0.828. This result is actually surprising because 
usually TF.IDF can outperform any other term weighting method. However, since the score difference is not 
very significant, we can say that the term weighting method variation in study has no remarkable effect for 
sentiment analysis using Random Forest. 


4. CONCLUSION 

In this study, we explore Random Forest with several term weighting method for sentiment analysis in 
Indonesian Language. This system in this study consists of three main stages, preprocessing, 
features extraction and classification using random forest. The ouptut of classification result is two category, 
positive and negative. The experiment result showed that sentiment analysis using random forest give good 
performance with average OOB score 0.829. The result also depicted that all of the four term weighting 
method has competitive result. Since the score difference is not very significant, we can say that the term 
weighting method variation in study has no remarkable effect for sentiment analysis using Random Forest. 
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